Parameterizable benchmarking framework for designing a MapReduce performance model

نویسندگان

  • Zhuoyao Zhang
  • Ludmila Cherkasova
  • Boon Thau Loo
چکیده

In MapReduce environments, many applications have to achieve different performance goals for producing time relevant results. One of typical user questions is how to estimate the completion time of a MapReduce program as a function of varying input dataset sizes and given cluster resources. In this work, we offer a novel performance evaluation framework for answering this question. We analyze the MapReduce processing pipeline and utilize the fact that the execution of map (reduce) tasks consists of specific, well-defined data processing phases. Only map and reduce functions are custom, and their executions are user-defined for different MapReduce jobs. The executions of the remaining phases are generic (i.e., defined by the MapReduce framework code) and depend on the amount of data processed by the phase and the performance of the underlying Hadoop cluster. First, we design a set of parameterizable microbenchmarks to profile the execution of generic phases and to derive a platform performance model of a given Hadoop cluster. Then, using the job past executions, we summarize job’s properties and performance of its custom map/reduce functions in a compact job profile. Finally, by combining the knowledge of the job profile and the derived platform performance model, we introduce a MapReduce performance model that estimates the program completion time for processing a new dataset. The proposed benchmarking approach derives an accurate performance model of Hadoop’s generic execution phases (once), and then, this model is reused for predicting the performance of different applications. The evaluation study justifies our approach and the proposed framework: We use a diverse suite of 12 MapReduce applications to validate the proposed model. The predicted completion times for most experiments are within 10% of the measured ones (with a worst case resulting in 17% of error) on our 66-node Hadoop cluster. Copyright © 2014 John Wiley & Sons, Ltd.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Benchmarking and Performance studies of MapReduce / Hadoop Framework on Blue Waters Supercomputer

MapReduce is an emerging and widely used programming model for large-scale data parallel applications that require to process large amount of raw data. There are several implementations of MapReduce framework, among which Apache Hadoop is the most commonly used and open source implementaion. These frameworks are rarely deployed on supercomputers as massive as Blue Waters. We want to evaluate ho...

متن کامل

Profiling and evaluating hardware choices for MapReduce environments: An application-aware approach

The core business of many companies depends on the timely analysis of large quantities of new data. MapReduce clusters that routinely process petabytes of data represent a new entity in the evolving landscape of clouds and data centers. During the lifetime of a data center, old hardware needs to be eventually replaced by new hardware. The hardware selection process needs to be driven by perform...

متن کامل

On-the-Fly Task Execution for Speeding Up Pipelined MapReduce

The MapReduce programming model is widely acclaimed as a key solution to designing data-intensive applications. However, many of the computations that fit this model cannot be expressed as a single MapReduce execution, but require a more complex design. Such applications consisting of multiple jobs chained into a long-running execution are called pipeline MapReduce applications. Standard MapRed...

متن کامل

ارزشیابی فناوری اطلاعات با رویکردی تلفیقی

 The IT value measurement model proposed to evaluate the business value of IT. This model utilizes some different tools and techniques such as: Benchmarking, Balanced Scorecard, qualitative and quantitative measurement techniques. The basis of this model is a link between 3-layer IT classification and IT planning with a combinatorial approach. Connecting these three layers with effectiveness an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Concurrency and Computation: Practice and Experience

دوره 26  شماره 

صفحات  -

تاریخ انتشار 2014